python tutorial - Python - serialization with pickle and json - learn python - python programming
Serialization
- Serialization is the process of converting a data structure or object state into a format that can be stored (for example, in a file or memory buffer, or transmitted across a network connection link) and resurrected later in the same or another computer environment.
- When the resulting series of bits is reread according to the serialization format, it can be used to create a semantically identical clone of the original object.
- This process of serializing an object is also called deflating or marshalling an object. The opposite operation, extracting a data structure from a series of bytes, is deserialization (which is also called inflating or unmarshalling).
- In Python, we have the pickle module. The bulk of the pickle module is written in C, like the Python interpreter itself. It can store arbitrarily complex Python data structures. It is a cross-version customisable but unsafe (not secure against erroneous or malicious data) serialization format.
- The standard library also includes modules serializing to standard data formats:
- json with built-in support for basic scalar and collection types and able to support arbitrary types via encoding and decoding hooks.
- XML-encoded property lists. (plistlib), limited to plist-supported types (numbers, strings, booleans, tuples, lists, dictionaries, datetime and binary blobs)
Pickle
What data type can pickle store?
- Here are the things that the pickle module store:
- All the native datatypes that Python supports: booleans, integers, floating point numbers, complex numbers, strings, bytes objects, byte arrays, and None.
- Lists, tuples, dictionaries, and sets containing any combination of native datatypes.
- Lists, tuples, dictionaries, and sets containing any combination of lists, tuples, dictionaries, and sets containing any combination of native datatypes (and so on, to the maximum nesting level that Python supports).
- Functions, classes, and instances of classes (with caveats).
Constructing Pickle data
- We will use two Python Shells, 'A' & 'B':
- Open another Shell:
- Here is the dictionary type data for Shell 'A':
sample code:
- The time module contains a data structure, struct_time to represent a point in time and functions to manipulate time structs.
- The strptime() function takes a formatted string an converts it to a struct_time.
Saving data as a pickle file
- Now, we have a dictionay that has all the information about the book. Let's save it as a pickle file
- We set the file mode to wb to open the file for writing in binary mode. Wrap it in a with statement to ensure the file is closed automatically when we're done with it. The dump() function in the pickle module takes a serializable Python data structure, serializes it into a binary, Python-specific format using the latest version of the pickle protocol, and saves it to an open file.
- The pickle module takes a Python data structure and saves it to a file.
- Serializes the data structure using a data format called the pickle protocol.
- The pickle protocol is Python-specific; there is no guarantee of cross-language compatibility.
- Not every Python data structure can be serialized by the pickle module. The pickle protocol has changed several times as new data types have been added to the Python language, but there are still limitations.
- So, there is no guarantee of compatibility between different versions of Python itself.
- Unless we specify otherwise, the functions in the pickle module will use the latest version of the pickle protocol.
- The latest version of the pickle protocol is a binary format. Be sure to open our pickle files in binary mode, or the data will get corrupted during writing.
Loading data from a pickle file
- Let's load the saved data from a pickle file on another Python Shell B.
- There is no book variable defined here since we defined a book variable in Python Shell A.
- We opened the book.pickle file we created in Python Shell A. The pickle module uses a binary data format, so we should always open pickle files in binary mode.
- The pickle.load() function takes a stream object, reads the serialized data from the stream, creates a new Python object, recreates the serialized data in the new Python object, and returns the new Python object.
- The pickle.dump()/pickle.load() cycle creates a new data structure that is equal to the original data structure.
- Let's switch back to Python Shell A.
- We opened the book.pickle file, and loaded the serialized data into a new variable, book2.
- The two dictionaries, book and book2, are equal.
- After we serialized this dictionary and stored it in the book.pickle file, and then read it back the serialized data from that file and created a perfect replica of the original data structure.
- Equality is not the same as identity. We've created a perfect replica of the original data structure, which is true. But it's still a copy.
Serializing data in memory with pickle
- If we don't want use a file, we can still serialize an object in memory.
- The pickle.dumps() function (note that we're using the s at the end of the function name, not the dump()) performs the same serialization as the pickle.dump() function. Instead of taking a stream object and writing the serialized data to a file on disk, it simply returns the serialized data.
- Since the pickle protocol uses a binary data format, the pickle.dumps() function returns a bytes object.
- The pickle.loads() function (again, note the s at the end of the function name) performs the same deserialization as the pickle.load() function. Instead of taking a stream object and reading the serialized data from a file, it takes a bytes object containing serialized data, such as the one returned by the pickle.dumps() function.
- The end result is the same: a perfect replica of the original dictionary.
Python serialized object and JSON
- The data format used by the pickle module is Python-specific. It makes no attempt to be compatible with other programming languages. If cross-language compatibility is one of our requirements, we need to look at other serialization formats. One such format is json.
- JSON(JavaScript Object Notation) is a text-based open standard designed for human-readable data interchange. It is derived from the JavaScript scripting language for representing simple data structures and associative arrays, called objects.
- Despite its relationship with JavaScript, it is language-independent, with parsers available for many languages. json is explicitly designed to be usable across multiple programming languages.
- The JSON format is often used for serializing and transmitting structured data over a network connection.
- It is used primarily to transmit data between a server and web application, serving as an alternative to XML Python 3 includes a json module in the standard library. Like the pickle module, the json module has functions for serializing data structures, storing the serialized data on disk, loading serialized data from disk, and unserializing the data back into a new Python object. But there are some important differences, too.
- The json data format is text-based, not binary. All json values are case-sensitive.
- As with any text-based format, there is the issue of whitespace. json allows arbitrary amounts of whitespace (spaces, tabs, carriage returns, and line feeds) between values.
- This whitespace is insignificant, which means that json encoders can add as much or as little whitespace as they like, and json decoders are required to ignore the whitespace between values.
- This allows us to pretty-print our json data, nicely nesting values within values at different indentation levels so we can read it in a standard browser or text editor. Python's json module has options for pretty-printing during encoding.
- There's the perennial problem of character encoding. json encodes values as plain text, but as we know, there are no such thing as plain text. json must be stored in a Unicode encoding (UTF-32, UTF-16, or the default, utf-8). Regarding an encoding with json,
Saving data to JSON
- We're going to create a new data structure instead of re-using the existing entry data structure. json is a text-based format, which means we need to open this file in text mode and specify a character encoding. We can never go wrong with utf-8.
- Like the pickle module, the json module defines a dump() function which takes a Python data structure and a writable stream object. The dump() function serializes the Python data structure and writes it to the stream object. Doing this inside a with statement will ensure that the file is closed properly when we're done.
- Let's see what's in ebook.json file:
- It's clearly more readable than a pickle file. But json can contain arbitrary whitespace between values, and the json module provides an easy way to take advantage of this to create even more readable json files:
- We passed an indent parameter to the json.dump() function, and it made the resulting json file more readable, at the expense of larger file size. The indent parameter is an integer.
Data type mapping
- There are some mismatches in JSON's coverage of Python datatypes. Some of them are simply naming differences, but there are two important Python datatypes that are completely missing: tuples and bytes.
Python3 | Json | dictionary | object |
---|---|
list | array |
tuple | N/A |
bytes | N/A |
float | real number |
True | true |
False | false |
None | null |
Loading data from a JSON file
List to JSON file
- The following code makes a list of dictionary items and the save it to json. The input used in the code is semicolon separated with three columns like this:
- Before making it as a list of dictionary items, we add additional info field, 'value':